Data Visualization with R

require(dplyr)
require(ggplot2)
require(qdata)
data(bands)

Scatter plots are used to display the relationship between two continuous variables. Axes represent a variable each, while each point represents an observation. This plot is often the first way to describe data when you look at it.

This chapter presents how to build scatter plots and introduces some basic concept of ggplot2 graphics: aesthetics and geoms.

The first scatter plot

Suppose you are interested in the relationship between the humidity and the viscosity in the bands dataset.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + geom_point()
## Warning: Removed 6 rows containing missing values (geom_point).

The syntax is quite simple. The function ggplot() initializes the plot with the following parameters:

  • data refers to the dataframe to use for the plot, in this case is the bands data frame
  • mapping refers to the list of aesthetic mappings (visual properties) to use for plot, passed as arguments of the aes() function. In this case you map the data to aesthetics of plot like follow: the x-axis displays the humidity variable and the y-axis displays the viscosity variable.

Notes:

  • the variable names must be unquoted
  • the function is named ggplot() while the package is named ggplot2

The ggplot() function does not return anything. You have to add to ggplot(), which geometric object (geom) you want to add. A scatter plot is made by points, and so you will use the geom_point() function, without any argument.

The meaning of technical terms will be cleared in the following examples.

The function returns a warning message: in six observations at least one between humidity and viscosity has missing values and points cannot be drawn. In the following examples, the warning message will not be shown.

Alternatively, the plot can be initialized with ggplot() without any argument and the same arguments have to be passed to geom_point().

ggplot() + geom_point(data=bands, mapping=aes(x=humidity, y=viscosity))

To understand the difference between the two examples above, you can add a new geom to the plot. From the previous plot it appears a (weak) positive correlation between the variables. A regression line can be added to the plot. The geom_smooth() with method="lm" adds a regression line based on the linear model.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + geom_point() + geom_smooth(method="lm")

The plot now display a regression line with its confidence interval.

You can try to add a smoother to the plot, but starting from the code with ggplot() without arguments.

ggplot() + geom_point(data=bands, mapping=aes(x=humidity, y=viscosity)) + geom_smooth(method="lm")

There is no line. Why? The reason is quite simple: geom_smooth() do not know which data and which variable should be used for the linear model. You can solve the issue like follow.

ggplot() + 
  geom_point(data=bands, mapping=aes(x=humidity, y=viscosity)) +
  geom_smooth(data=bands, mapping=aes(x=humidity, y=viscosity), method="lm")

To sum up, data and aesthetics specified in ggplot are used by default by all layers. Layers are like tracing papers, you can overlay them. Each layer contains a geometry. If a layer has its own data and/or aesthetics, that layer will ignore the default values.

Assigning ggplots to a variable

From a formal point of view, a ggplot is an R object like anything else; so you can assign it to a variable.

gp1 <- ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + geom_point() 

And now you can recall the plot named gp1.

gp1

Once you assigned the scatter plot to gp1, you can add to this plot a smoother by doing:

gp2 <- gp1 + geom_smooth(method="lm")
gp2

Changing the shape and size of points

At this point, you probably are interested to change the aspect of points like shape or size: it suffices to set the shape or the size as a parameter of geom_point().

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(shape=2)

The following shapes are available in R graphics. Point shapes from 0 to 14 have just an outline, shapes from 15 to 20 are solid and shapes from 21 to 25 have both an outline and a fill. Default shape for ggplot2 graphics is 16.

Shape can also be a (single) character string.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(shape="$", size=3)

The same way works for size too.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(size=5)

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(shape=3, size=1)

Changing the colour of points

Colour can be set in a similar way than shape and size: setting colour as a geom_point() parameters. Both UK and US spellings (color) can be used.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(colour="red")

Shapes from 21 to 25 allow two colours. In these cases, colour set the outline and fill set the internal colour.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(shape=21, colour="red", fill="#FF0000")

When you need to pass a colour to R, you can use a string with the colour name. There are 657 colour names in R; just type colours() to view them. Alternatively, you can pass a string with the hexadecimal code (e.g. “#FF0000”) or you can use rgb(), hsv() or hcl() if you are familiar with these colour models.

Data used in these examples has 540 observations but the plot seems have less points. This is because many points overlap. Transparencies are useful in this cases. The alpha aesthetic set the transparency level: legal alpha values are any numbers from 0 (transparent) to 1 (opaque).

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) +
  geom_point(alpha=0.25)

Since alpha=0.25 (and 0.25 is 1/4) a point will be drawn as solid when four points overlap.

If you are familiar with base graphics in R, shape substitutes pch, size substitutes cex, colour substitutes col and fill substitutes bg. The base syntax can be used also in ggplot2 but it is strongly suggested to migrate to the new and more intuitive syntax.

Mapping a third variable to scatter plots

Scatter plots were born to visualize the relationship between two variables: one mapped to the x-axis and one mapped to the y-axis. Sometimes, a third variable should be visualized. In these case, you can map a third variable to other aesthetics: size, shape or colour.

Suppose you’re interested in the relationship between humidity and viscosity accordingly the presence or absence of band_type. To perform this task, you have to map band_type to colour.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, colour=band_type)) +
  geom_point()

Note that mapping occurs within aes(), while setting occurs outside of aes().

The plot shows the same points that previous ones with different colours and a legend will be added.

Alternatively, you can map band_type to another aesthetic, like shape.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, shape=band_type)) +
  geom_point()

Since different shapes are more difficult to read when you have many points, this solution provide an alternative when you are printing in black and white, without colours. You can improve your result using both aesthetics together.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, colour=band_type, shape=band_type)) +
  geom_point()

You can map band_type to size, but the result advise against this choice.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, size=band_type)) +
  geom_point()

If you’re interested in a continuous variable as third variable you can map it to colour or to size. It makes no sense map a continuous value to shape.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, colour=ink_pct)) +
  geom_point()

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, size=ink_pct)) +
  geom_point()

It is more difficult perceiving small differences in size and colour, so variable mapped to these aesthetic attributes will be interpreted with a much lower accuracy than those mapped to spatial coordinates (x and y).

The following code will produce a scatter plot of humidity versus viscosity with band_type mapped to colour.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, colour=band_type)) +
  geom_point() + geom_smooth(method="lm")

The scatter plot has two regression lines. This is the expected results, as band_type is mapped to colour in the ggplot() function and its arguments are used not only by geom_point() but also by geom_smooth().

To produce a scatter plot with a single regression line the colour aesthetic must be mapped only to geom_point().

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) +
  geom_point(mapping=aes(colour=band_type)) + geom_smooth(method="lm")

Finally, this is the plot if you are interested to three regression line, one for all values and one for each level of band_type.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) +
  geom_point(mapping=aes(colour=band_type)) + 
  geom_smooth(mapping=aes(colour=band_type), method="lm") +
  geom_smooth(method="lm")

Mapping four variables to scatter plots

Although the interpretation may be difficult, different variable can map to different aesthetics at the same time. From a theoretical point of view you can map as many variable as the number of aesthetics, but it is not suggested map more than four variable.

This is the result when you are interested to the ink percentage and band type, at the same time.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, size=ink_pct, colour=band_type)) +
  geom_point()

When a variable is mapped to size it is not suggested to map another variable to shape. This is because it is difficult to compare the sizes of different shapes.